224 research outputs found
Recommended from our members
An efficient global resource constrained technique for exploiting instruction level parallelism
A new Global Resource-constrained Percolation (GRiP) scheduling technique is presented for exploiting instruction level parallelism. Other techniques that have been proposed either have been prohibitively expensive in terms of computation or have limited parallelism. The GRiP technique has been implemented and simulation results are presented
Recommended from our members
Percolation scheduling with resource constraints
This paper presents a new approach to resource-constrained compiler extraction of fine-grain parallelism, targeted towards VLIW supercomputers, and in particular, the IBM VLIW (Very Large Instruction Word) processor. The algorithms described integrate resource limitations into Percolation Scheduling—a global parallelization technique—to deal with resource constraints, without sacrificing the generality and completeness of Percolation Scheduling in the process. This is in sharp contrast with previous approaches which either applied only to conditional-free code, or drastically limited the parallelization process by imposing relatively local heuristic resource constraints early in the scheduling process
Recommended from our members
Incremental tree height reduction for code compaction
This paper introduces a new Tree Height Reduction (THR) technique for code compaction. THR, which is well known parallelizing method, has two interesting properties: while known compilation techniques can get constant factor of speed-up, THR has speed-up of O(n/logn). Furthermore, THR is able to compact code which seems, at first, uncompactable (due to data dependencies). The algorithm presented is incremental, local (so in each step, it is checking the the current operation and its predecessor rather than the whole expression tree to see whether compaction is possible) and applicable beyond basic block limits. THR is applied after all other optimization techniques, none of which change the semantics of the code, have been applied. THR is changing the semantics of the code, thus preserving, of course, the correctness of the intermediate and final values. Also, the reduction is controlled according to the resources available - so in case the compaction is feasible but there are not enough resources - it moves to the next operation. The algorithm produces compacted code suited for any tightly coupled multiprocessors (e.g. Very Long Instruction Word {or VLIW) machines). To our knowledge, it is the first local and incremental THR algorithm working across basic blocks boundaries published so far for code compaction
Recommended from our members
Fault tolerance in super-scalar and VLIW processors
In this paper, we present a method for utilizing the spare capacity in super-scalar and very long instruction word (VLIW) processors to tolerate functional unit failures. Unlike previous work that was primarily interested in detection of transient faults, we are concerned with more permanent and/or intermittent faults which necessitate processor reconfiguration. Our method utilizes the VLIW compiler or the superscalar scheduler to insert redundant operations whenever idle functional units exist. The results of these redundant operations are used to detect and diagnose functional unit failures. For super-scalar processors, the scheduler can then utilize this information to ensure that operations are performed only on non-faulty units. In VLIW processors, this is equivalent to recompiling the code to run on the remaining non-faulty functional units. Since in certain applications, recompilation may not be possible, we consider two alternative reconfiguration strategies for VLIW processors. These strategies sacrifice storage space and execution time, respectively, in order to reconfigure without recompiling. We present Markov models that describe the behavior of processors using these different approaches and we evaluate their reliabilities. The results show that, while super-scalar and VLIW with recompilation provide the highest reliability, all proposed strategies significantly increase reliability over that of an unprotected processor
Recommended from our members
Parallelizing non-vectorizable loops for MIMD machines
Parallelizing a loop for MIMD machines can be described as a process of partitioning it into a number of relatively independent subloops. Previous approaches to partitioning non-vectorizable loops were mainly based on iteration pipelining which partitioned a loop based on iteration number and exploited parallelism by overlapping the execution of iterations. However, the amount of parallelism exploited this way is limited because the parallelism inside iterations has been ignored. In this paper, we present a new loop partitioning technique which can exploit both forms of parallelism - inside and across iterations. While inspired by the VLIW approach, our method is designed for more general, asynchronous, MIMD machines. In particular, our schedule takes the cost of communication into account, and attempts to balance it with respect to parallelism. We show our method is correct, efficient, and produces better schedules than previous iteration level approaches
Recommended from our members
A spill code minimization algorithm for loops
Loops are the main source of parallelism in applications. The issue of finding an optimal register allocation to loops has been an open issue for some time. In this case optimal refers to the minimization of spills from registers to memory. In this paper we address this issue and present an optimal, but exponential algorithm which allocates registers to loop bodies such that the spill code is minimal. We also show heuristic modifications to the algorithm which perform in practice as well as the exponential approach. Finally, we examine this algorithm's feasibility in production compilers
Recommended from our members
N-Dimensional Perfect Pipelining
In this paper, we introduce a technique to parallelize nested loops at the fine grain level. It is a generalization of Perfect Pipelining which was developed to parallelize a single-nested loop at the fine grain level. Previous techniques that can parallelize nested loops, e.g. DOACROSS or Wavefront method, mostly belong to the coarse grain approach. We explain our method, contrast it with the coarse grain techniques, and show the benefits of parallelizing nested loops at the fine grain level
Recommended from our members
Percolation scheduling for non-VLIW machines
Percolation Scheduling, a technique for compile-time code parallelization, has proven very successful for exploiting fine-grain irregular parallelism in ordinary programs. Currently, this technology is targeted only to VLIW (Very Long Instruction Word) machines, which have the advantages of 'free' synchronization and communication. Shared memory multi-processors can simulate the execution characteristics of VLIW machines with the use of static barriers. Preliminary results show that Percolation Scheduling can be used with good results on this type of architecture by increasing the granularity from operation level to source statement level, removing any redundant synchronization, and providing an efficient implementation of multi-way jumps
Recommended from our members
Fine grain software pipelining of non-vectorizable nested loops
This paper presents a new technique to parallelize nested loops at the statement level. It transforms sequential nested loops, either vectorizable or not, into parallel ones. Previously, the wavefront method was used to parallelize non-vectorizable nested loops. However, in order to reduce the complexity of parallelization, the wavefront method regards an iteration as an unbreakable scheduling unit and draws parallelism through iteration overlapping. Our technique takes a statement rather than an iteration as the scheduling unit and exploits parallelism by overlapping the statements in all dimensions. In this paper, we show how this finer grain parallelization can be achieved with reasonable computational complexity, and the effectiveness of the resulting method in exploiting parallelism
Recommended from our members
A mapping strategy for MIMD computers
In this paper, a heuristic mapping approach which maps parallel programs, described by precedence graphs, to MIMD architectures, described by system graphs, is presented. The complete execution time of a parallel program is used as a measure, and the concept of critical edges is utilized as the heuristic to guide the search for a better initial assignment and subsequent refinement. An important feature is the use of a termination condition of the refinement process. This is based on deriving a lower bound on the total execution time of the mapped program. When this has been reached, no further refinement steps are necessary. The algorithms have been implemented and applied to the mapping of random problem graphs to various system topologies, including hypercubes, meshes, and random graphs. The results show reductions in execution times of the mapped programs of up to 77 percent over random mapping
- …